arXiv:1506.05757vl [stat.ME] 18Jun2015 


Bayesian Inference for the Multivariate 
Extended-Skew Normal Distribution 


Mathieu Gerber* 


Florian Pelgrin^ 


Faculty of Business and Economics (HEC) 
University of Lausanne, Switzerland 

CREST 


EDHEC Business Sehool, France 

CIRANO 


The multivariate extended skew-normal distribution allows for accommod¬ 
ating raw data which are skewed and heavy tailed, and has at least three 
appealing statistical properties, namely closure under conditioning, affine 
transformations, and marginalization. In this paper we propose a Bayesian 
computational approach based on a sequential Monte Carlo (SMC) sampler 
to estimate such distributions. The practical implementation of each step 
of the algorithm is discussed and the elicitation of prior distributions takes 
into consideration some unusual behaviour of the likelihood function and 
the corresponding Fisher information matrix. Using Monte Carlo simula¬ 
tions, we provide strong evidence regarding the performances of the SMC 
sampler as well as some new insights regarding the parametrizations of the 
extended skew-normal distribution. A generalization to the extended skew- 
normal sample selection model is also presented. Finally we proceed with 
the analysis of two real datasets. 

Keywords: Bayesian estimation; Bayes factor; Sequential Monte Carlo; 
Skew-elliptical distributions 


‘Present address: Harvard University, Department of Statistics. Email: mathieuger- 

ber@fas.harvard.edu 
^ Email: florian.pelgrin@edhec.edu 


1 



1. Introduction 


Recent years have seen a growing interest for flexible parametric families of multivariate 
distributions that can accommodate both the skewness and the kurtosis often observed 
on data. This is especially important, e.g., in health, finance and environmental data, 
which are often skewed and heavy tailed. For instance, these two features of health (- 
care) expenditures have fundamental implications in topics related to risk adjustments, 


program and treatment evaluations, or insurance choices (Manning et ah, 2005). In 
this respect, the application of the skew-elliptical family of distributions (i.e., all non- 
symmetric distributions obtained from an elliptical distribution) has been put forward 
in the literature, and for good reasons. On the one hand, this family (or certain distribu¬ 
tions of this family) has at least three appealing properties: closure under conditioning, 
affine transformations, and marginalization. On the other hand, these parametric dis¬ 


tributions appear in the natural and important context of selection models (Heckman 


1976; Copas and Li, 1997; Arnold and Beaver, 2002). This last feature is particularly 


relevant in various research topics (e.g., economics, environmetrics or political sciences). 

Within the class of skew-elliptical distributions, the extended skew-normal (ESN) 
distribution appears in different areas of statistical theory, e.g., Bayesian statistics 
( jO’Hagan and Leonard 1976), regression analysis (Copas and Li 1997) or graphical 
models (Capitanio and Stanghellini, 2003). This generalization of the skew normal distri¬ 


bution, studied in the seminal paper of Azzalini (1985), has been developed and studied 


by Arnold et al. (1993); Arnold and Beaver (2000); Capitanio and Stanghellini (2003). 
Such a family of distributions may be obtained as a convolution between a multivariate 
normal random variable, Y, and a truncated standard normal random variable, Z, say 
Y=Y -|- dZ. The derivation and statistical properties of ESN distributions make it a 
natural candidate to model skewed and (leptokurtic, platikurtic) mesokurtic data gen¬ 


erating processes. However, as pointed out by Arellano-Valle et al. (2006), statistical 
inference of skew-elliptical distributions, and thus of the extended skew-normal family 
of distributions, is still mostly unsolved (even in the univariate case). 

Indeed, if Bayesian analysis of some families of skew-elliptical distributions has been 
proposed in the literature, they mainly focus on the skew-normal (or skew-t) distribution. 
In addition, Bayesian analysis of these distributions usually rely on objective prior-based 
methods, Gibbs sampling or population Monte Carlo. In particular, Liseo and Loperfid^ 
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(2006) consider a Bayesian estimation of the univariate skew-normal distribution based 


on objective priors whereas Wiper et al. (2008) analyse the half-normal and half-t cases, 


and Branco et al. (2013) focus on the skew-t distribution. Cabral et al. (2012) propose a 


full Bayesian estimation of a mixture of skew-normal densities while Friiwirth-Schnatter 


and Pyne (2010) provide a Gibbs sampler to estimate a mixture of skew-normal and skew- 


t densities. As an alternative to the Gibbs sampler, Liseo and Paris! (2013) advocate 


the application of a Population Monte Carlo (PMC) algorithm for missing data (Celeux 


et al., 2004) in order to sample from the posterior distribution of the skew-normal model. 


In this paper we propose a Bayesian computational method to estimate the extended 
(multivariate) skew-normal distribution. The Bayesian approach for this family of dis¬ 
tributions is motivated by some severe anomalies of the likelihood function and some 
identification issues. Notably, and as shown in this paper, the maximum likelihood es¬ 
timator may not be uniquely defined for univariate ESN distributions. In a Bayesian 
approach, these anomalies can be tackled to some extent by using a suitable elicitation of 
prior distributions. In the light of the properties of the likelihood function, we make use 


of a (tempered) sequential Monte Carlo sampler (Del Moral et al., 2006) rather than a 
Monte Carlo Makov chain (MCMC) algorithm. Briefly speaking, sequential Monte Carlo 
(SMC) iterates importance sampling steps, resampling steps and Markov kernel trans¬ 
itions in order to recursively approximate a sequence of distributions by making use of 
a sequence of weighted particle systems. Relative to MCMC algorithms, SMC might be 
called for at least three arguments. On the one hand, the great generality of SMC allows 
to build efficient algorithms in order to sample from complicated distributions and thus 
to overcome some distortions of the likelihood functions of the skew-normal distribution. 
On the other hand, compared to MCMC methods, it is easier to make SMC algorithms 
adaptive in the sense that they can be adjusted sequentially and automatically to the 
problem at hand, and the evidence or marginal likelihood of data can be derived formally. 
Finally, due to the convolution representation of the ESN distribution presented above, 
a natural idea to sample from the posterior distribution would be to implement a Gibbs 
sampler in which the hidden random variable Z is an extra parameter. However, in the 
case of ESN models, the support of the hidden variable Z depends on the parameters of 
interest and thus the posterior distribution in the augmented space does not satisfy the 


positivity condition (see Robert and Gasella, 2004, chapter 9). 

The rest of the paper is organized as follows. Section defines two equivalent rep- 
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resentations of extended skew-normal random vectors, review some useful properties of 
this class of distributions, and discuss some unpleasant features of the maximum likeli¬ 
hood function. Section [^proposes some prior distributions which take into consideration 
theses anomalies of the likelihood function and describes the proposed tempered sequen¬ 
tial Monte Carlo algorithm. Section presents some Monte Carlo simulations regarding 
the inference of univariate ESN distributions and of some regressions with missing data. 
Moreover we discuss the testing and model selection problems. Section deals with two 
applications, namely the distribution of transfer fees of soccer players in major European 


leagues and the bivariate distribution of two hnancial returns (Liseo and Parisi, 2013). 
The last section provides some concluding remarks. 


2. The extended skew-normal distribution 

In this section we first define the extended skew-normal (ESN) distribution using two 
different parametrizations. Then, we review some appealing properties of this class of 
distributions, especially in the light of the subsequent derivations of this paper. Ei- 
nally, we provide a new theoretical justification for the unsatisfactory behaviour of the 
maximum likelihood estimator of the ESN distribution. 

2.1. Definition and main properties 

We consider two parametrizations of the ESN distribution. The first parametrization, 
denoted PI, is based on hidden truncation (and/or selective reporting) using normal 
component densities whereas the second parametrization, denoted P2, rests on the con¬ 
volution of a multivariate normal distribution with a truncated standard normal variable. 

Definition 1. A random vector Y is said to have a d-dimensional extended skew-normal 
distribution, denoted Y ~ S, q. A), with covariance (correlation) matrix T,, 

shape parameter a, and shift parameter X, if 

Y% + Y,|A + „'Y,>Z.), (z;)~A^«((°).(o ")) (1) 

where Nd+i{pi', B) denotes the {d1)-dimensional Gaussian distribution with mean 
and variance-covariance matrix B. Its density function is defined to be: 

/y(y) = <('d(y,^,S) co = Vi + WSa (2) 
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where 4>d fJ-, B) is the density of the d-dimensional normal distribution with mean 
and covariance (correlation) matrix B and $(•) is the cumulative density function (cdf) 
of the J\fI {0,1) distribution. 

On the other hand, the ESN distribution can be defined from a convolution. 


Definition 2. A random vector Y is said to have a d-dimensional extended skew-normal 
distribution, denoted Y ssNf^\i,vt,d,c), if 

Y=% + dZs 

where —Z 3 ~ TAfc{0,l), the A/i( 0 , 1 ) distribution truncated to (—oo,c], and Y 3 ~ 
A/rf(0,0). Its probability density function is defined to be: 


fviy) = ((d (y,^,0 + dd') 


$ (^co|c + d'[n + dd'] ^(y-^ 


^(c) 


Several points are worth commenting. First, the ESN distribution belongs to the fam¬ 


ilies of skew-elliptical distributions proposed by Arnold and Beaver (2002), Dominguez- 


Molina et al. (2003), Fang (2003), and Arellano-Valle and Genton (2010). Alternatively, 


using P2, the ESN distribution belongs to the family of distributions proposed by Sahu| 


et al. (2003). Irrespective of the parametrization, the ESN distribution generalizes the 


multivariate skew-normal distribution (Azzalini and Dalla Valle, 1996) and thus the 


Gaussian distribution. More specifically, when the shift parameter A is set to zero, one 
obtains the (multivariate) SN distribution. On the other hand, the standard normal 
distribution results from the nullity of the shape parameter vector o:. As explained in 


Section 2.2, this constraint on the shape parameter vector has some key implications on 
inference. Indeed, the Fisher information matrix of the ESN (and of the SN) distribu¬ 
tion is singular, preventing a straightforward application of standard likelihood-based 
methods to test the null hypothesis of normality. The problem is even made worse by 
the parameter A, which indexes the distribution in the case of non-normality (nuisance 
parameter). 

Second, the choice of the parametrization might be critical for the estimation of ESN 
distributions since, in a Bayesian perspective, different parametrizations lead to altern¬ 
ative choices of prior distributions and thus different models (see Section]^. Third, one 
key feature of the ESN distribution over the SN distribution is that the former has an 
extra parameter that allows for a larger range of values for skewness and kurtosis and 
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thus for more flexibility to accommodate real data. For instance, using the moment 


generating function of Dominguez-Molina et al. (2003), one can provide evidence with 
a numerical analysis of the univariate ESN that the skewness coefficient is bounded by 
2 (in absolute value) while the kurtosis coefficient varies roughly between 2.75 and 7. 
In contrast, Azzalini ( |1985 ) points out that the skewness is smaller (in absolute value) 
than 0.995 and that the kurtosis lies between 3 and 3.87 in the case of the skew-normal 
distribution. 

Fourth, the ESN distribution has three familiar and useful properties, especially for 
regression-type models. It is closed under affine transformations, conditioning and mar¬ 
ginalization. On the one hand, ESN random vectors share the affine transformation of 
normal random vectors. In particular, let A denote an d x d non-singular matrix and 
I G Then, taking ([^, one obtains | -|- A'Y ~ + A'^,A'TiA,A~^a,\). 

On the other hand, if an ESN vector is partitioned into two components, the conditional 
distribution of one component given the other is extended skew-normal and each com¬ 
ponent is marginally extended skew-normal. For sake of completeness, Proposition 


due to Fang (2003) and Dominguez-Molina et al. (2003) reports the closure of the ESN 
distribution under conditioning and marginalization. 

Proposition 1. Assume that Y , S,q:,A). Partition Y, a and S as 

Y = (Yi, Y2)', e = (ei, £2)', a = (cri, 0:2)^ and S = ^ ^ where Yj, and cti are 

mi X 1 and 'Eu is mi x mi. Then, 


Yi ~ Eu, c^OLi, CiX), {Yi\Yj = y,) ~ A^) 

where a = {1 + a'jEi,iaj)~^/‘^, = ^. -^ EijEj^^{yj - $j), Eii,i = Eu - EijEj^^Eji, 

ai = cxi + and Xi = X + aj^yj - $j). 

Finally, the stochastic representation 0 of ESN random vectors leads to the following 
expression of the cumulative density function (henceforth, cdf): 

P(Y < y) = 

^ ^(A/co) 

with <I>rf+i(a, E, a, X) = P(Y 2 < a, Z 2 < A) and where 
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Notably, the evaluation of the cdf of the d-dimensional ESN distribution has the same 
complexity as the computation of the cdf of the (d+l)-dimensional Gaussian distribution, 
for which efficient methods are available (e.g., see |Huguenin etldT 2014). It turns to be 
very useful in practice since, for instance, the cdf of the ESN distribution arises naturally 
when deriving the expression of the likelihood function in the presence of missing data 


(see Section 4.2). 


2.2. Log-likelihood function 

Since our methodology rests on Bayesian estimation and thus on the posterior distri¬ 
bution associated to the ESN-based model, it is fundamental to study the statistical 
properties of the likelihood function. This might provide some useful insights in order to 
determine the prior distribution and thus challenge some identified anomalies regarding 
the likelihood function (i.e., to correct at least partially the odd behaviour of the likeli¬ 
hood function with external information). Eor sake of exposition, we concentrate on the 
univariate ESN distribution. 

Maximum likelihood estimation of ESN distributions is challenging and quite difficult 
to manage. More specifically, it is widely acknowledged that (i) there are no closed form 
expressions for the maximum likelihood estimator (MLE), (ii) the MLE of a can be 
infinite even in very simple settings, (iii) the multimodality of the log-likelihood profile 
(and thus local solutions) can not be ruled out and (iv) there exists an inflexion point 
at a = 0. In particular, the Eisher information matrix tends to be singular as a goes 
toward zero irrespective of the A parameter. Note that, in this case, ESN distributions 
are no longer indexed by the normal cumulative density functions and, consequently, 
the rank of the information matrix might be at least two less than its full rank. On the 
other hand, the presence of a stationary point (e.g., using the profile log-likelihood for 
the a parameter) and of multiple modes generally cause numerical issues. 

While these issues have been outlined in the literature, to the best of our knowledge, 
there is not yet a formal proof of the near unidentifiability of the log-likelihood function 
and the A parameter. Therefore, we show in Proposition!^ that the presence of the shift 
parameter A in PI might lead to local maxima for the maximum likelihood estimator of 
the univariate extended skew-normal distribution. Indeed, irrespective of the data and 
for all I G M, the ESN distribution admits a stationary point at q := (^„ q, 0^, Z), 
where ^n,G '^n,G are the MLE of ^ and S under the Gaussian assumption. In so 
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doing, if this stationary point is an inflexion point when we impose the A parameter to 
be zero ( Azzalini and Capitanio[ |1998[ ), the problem becomes even more severe when A 
is a free parameter as stated in Proposition!^ 

Proposition 2. Let Yi,...,Yn be n i.i.d. random variables, Yi 

with a^o. Let = {Cn,G, 0 with ^n,G = ^ Er=l ^l,G = ^ TJl=l - 

Let Ln{0) denote the log-likelihood function. Then, 

1. With probability one, there exists a T G M such that LniO^^Q) is a local maximum 
of Ln{-) for all I < 1*; 

2. With strictly positive probability, Ln{9^.^Q) = Ln{0n), I G K? where On is a 

global maximizer of Ln{-). 


See Appendix for a proof. 

The first result of Proposition!^ has an intuitive interpretation. When a = 0, the value 
of the log-likelihood function is insensitive to any change of the A parameter and thus any 
small deviation of a leads to large deviations from the true log-likelihood value (since 
the estimate A was initially far from its true unknown value). Consequently, a small 
deviation from 9^^ q in any direction reduces the value of the likelihood. The second part 
of Proposition!^ is more puzzling because it implies that, with a positive (but decreasing 
with n) probability, the likelihood function does not allow to discriminate between the 
Gaussian and the ESN model. This is a particularly severe anomaly of the likelihood 
function because it implies that the MLE might be not uniquely defined. 


3. Bayesian analysis of the ESN distribution 


In this section we first discuss the elicitation of prior distributions and then explain 
how to estimate the parameters of the two parameterizations of ESN distributions using 


Sequential Monte Carlo (Del Moral et ah, 2006). 


3.1. A default Prior specification 

In contrast to the standard approach of default prior distributions, and in the spirit of 
Gelman et al.| (2008), we propose a prior specification that embeds enough information 


to circumvent the anomalies of the log-likelihood function listed in Section 2.2 
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On the one hand, the (4, S, a, A)-parametrization (PI) of ESN random vectors must 
tackle two issues, namely the potential existence of multiple modes and the identification 
(estimation) of the truncation point, that are related to the identification of the A para¬ 
meter. First, the multi-modality of the log-likelihood function might be attenuated by 
setting a prior that assigns less weight on very negative values of A. Second, as argued 
in Section |2.2[ values of A such that the truncation point exceeds a certain threshold, 
say |A|/co > 2, are difficult to identify and therefore, both to avoid extreme estimates of 
A and to facilitate its identification in this region of the parameter space, it is important 
to choose a prior 7r(dA|Il,Q;) that puts small weights on {/ G M : |Z|/co > 2}. In so 
doing, we propose to consider a conditional normal prior distribution with mean zero 
and variance Cq, i.e. A|(S, a) ~ A/i(0, Cq). This naturally leads to a normal-inverse Wis- 
hart distribution as a prior for (4, S), which is the conjugate prior for Gaussian models 
(e.g., see jGelman et ah 2004). Note that it turns to ease the Bayesian model selection 
procedure (see Section 3.2.3). Hence, 


7 r(^,S|Q;) oc exp (^ - ^tr(ES ^) - ^(^ - ^o)'^ 


iy-\-d-\-2 

2 


(3) 


where V is a d x d positive definite matrix, k and v are real such that > d -|- 3. This 
last condition ensures that the mean of the prior distribution of S exists and that all 
its components has finite variance. Finally, one can choose a vague prior for a, e.g. 
a ~ AfdifJ-cti with cr^ large and Id the dx d identity matrix. In practice, it is 

likely to have information on the sign of a*, i = l,...,d, through information about 
the asymmetry of the full conditional distribution of Yi (see Proposition!^. This prior 
knowledge can be incorporated in the Bayesian analysis by taking / 0^. 

On the other hand, the (^, H, d, c)-parametrization shares the same issues as the PI- 
parametrization since c = A/cq. In addition, since an ESN random vector Y is defined 
by Y=4 -|- dZs + where —Z 3 ~ TJYciO, I) and Y 3 ~ J\fd{0d, Id), the convolu¬ 

tion representation of ESN random vectors leads to an additional identification problem 
that arises when H is “large” or “small” relative to d —more variability of one of the 
convolution-based density is obtained at the expense of weak identification of the other. 
In this respect, we assume that d|(^, H) '^hh Kd = 2{al{i> - d - 1)) ^ 

yielding, on average, the same variance for both d and o:. The choice of a Gaus¬ 
sian distribution for (djU) is motivated by the fact that, together with the assump¬ 
tion that the prior distribution of (^, H) is the normal-inverse Wishart distribution 
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7 r(^, S|^g, K, i>, y), we can easily implement a Gibbs sampler when c is known (Sahu 


et ah, 2003). Say differently, the Gaussian prior for d and the normal-inverse Wishart 


prior for (^, fl) are some natural candidates for the SN distribution of Sahu et al. (2003). 
Since 2 — 0 = Sqq'S/cq, which is a positive definite matrix, we choose {i>, V) such that 
the inverse Wishart distribution yV~^{V,v) gives more weight to “small” values than 
the yV~^{y,v) distribution. This can be done by taking V such that V — V is positive 
definite and v > v. In this case, the difference between the mode under (i>, V^) and the 
mode under [u, V) is negative definite. This also holds for the mean. 


3.2. A SMC sampler for multivariate ESN distributions 
3.2.1. General description 

Let 9 be the vector whose components are the parameters of the model (either under PI 
or under P2), /(zi:„|0) be the likelihood function, where zi-n = (zi, ...,Zn) is the set of 
observations, and tt{6 ) be the prior distribution of the parameters, which is either 7rpi(0) 
under PI or 7rp2(0) under P2. Using these notations, the posterior distribution we want 
to sample from is given by: 


7r(6'|zi:„) oc f{zi.,n\9)7r{9). 


As pointed out by Del Moral et al. (2006), sequential Monte Carlo samplers are relevant 


when there is no fully-eligible proposal distribution, say in order to implement the 

importance sampler. The SMC sampler requires to define a sequence of distributions 
{7rt{9)}f^i such that (i) itt{9) = 7r(0|zi:„) and (ii) 7ri(0) = rii{6), with rji a distribution 
we can easily sample from. This sequence of intermediary distributions is purely instru¬ 
mental and could be defined by making use of an appropriate real sequence of so-called 


temperatures {pt}t=i, increasing from zero to one. Following Gelman and Meng (1998), 
Neal (2001) and [Del Moral et al. (2006), we consider the geometric bridge: 


7rt{9) :oc r]i{e)^ ^‘7r(6'|zi:n) 




The basic idea of a SMC algorithm is first to sample > 1 particles from the initial 
distribution tti in order to obtain a Monte Carlo approximation tt^ = N~^ Ylm=i ^9™ 

TTi. Then, using resampling and propagation steps, SMC uses the approximation of 
TTi to construct vr^ = J2m=i a Monte Carlo approximation of 7r2. Informally, if 
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TT^ is a good approximation of tti and if 7r2 is close to tti, then one may expect vr^ to 
be close to 7r2, and so on. 

More precisely, suppose that, for a t G {2,..., T}, one has at hand a sample 
such that: 

1 ^ 

- 60^(dO) « n{e)de. 

m=l 

Then, we can approximate iit+i by the empirical distribution 

N 

m=l 

where the corresponding importance functions are defined to be: 


wr+i{p) = 


wt+i{eT,p) 

T.U^t+M.p) 


wt+i{e,p) = 


7r(6l|zi,„ 
Pi{&) . 


p-pt 


Note that pt+i — Pt measures the step length at time t + 1 so that, the larger the 
difference, the more the accuracy of the importance weighting worsens. To control 
such a degeneracy, we consider a procedure to determine a suitable sequence of 
through the effective sample size criterion. More specifically, instead of regarding T and 
the set as parameters of the algorithm, we view them as self-tuning parameters 


using the method proposed by Schafer and Chopin (2013). Given a value of pt and a 
sample {6T'\m=i approximates vr*, we compute the largest value of p G {pt, 1] such 
that the particle system once being properly weighted, allows to approximate 

“reasonably well” the probability distribution Tip oc through the effective sample 

size criterion (Liu and Chen 1995): 


ESSi(p) = 


N 


E "Tip? 


jn=l 


-1 


where, by definition, W^{p) is the weight assigned to 0™ to target TTp. If the effective 
sample size equals N, the interpretation is that the weights are equally balanced and 
that all N particles are equally contributing to the estimation. Then, pt+i is defined as 
the minimum between 1 and p*j^i with: 


p*t+i = sup {p> Pt - ESSt(p) > 13} 


where /? is a pre-specified threshold, say /3 = N/2. The fixed value p}j^i can be obtained 


by solving the equation ESS 4 (p) = (3 using the bi-sectional search algorithm of Schafer 
and Chopin (2013) (see Algorithm]^. 
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Algorithm 1 Tempering Sequential Monte Carlo Sampler 
Operations must be performed for all m = 1,. .., iV. 

Initialization 

Set t = 2 and pi = 0. 

Generate 0™ ~ ??i(d0) and compute W^{pi). 
while pt-i < 1 do 

Compute pt using Algorithmwith inputs pt-i and {0^i}m=i- 
Resampling: Generate a^i = where ^ ut ((0,1)) and 

N 

Ft,N{i) = E < i). 

m=l 

Propagation: Generate ~ ,d9). 

Set t i — t T 1 • 

end while 


At every pt, a resampling step, using the systematic resampling method of Carpent^ 


et al. (1999), is first performed in order to suppress particles that are in the region of 


the parameter space that receives very little mass from vrt. Say differently, the particles 
with the largest weights have multiplied whereas those with the smallest weights have 
vanished after the resampling step. Then, to restore particle diversity, new particles are 
generated from a Markov Kernel {9', d9) with invariant distribution vr* (see further). 

The complete procedure is summarized in Algorithm Any operation involving the 
superscript m (respectively, subscript t) must be understood as performed for m G 1 : A 
(respectively, t G 0 : T) where N (respectively, T) is the total number of particles 
(respectively, number of iterations). Note that n denotes the sample size. In addition, 
the procedure to find the step length is described in Algorithmic 


3.2.2. Implementation 

In our implementation we follow the usual approach and take for {9', d0) the Markov 
kernel that corresponds to r steps of the Gaussian random-walk Metropolis-Hastings 
algorithm with variance-covariance matrix given by ■ The constant > 0 is a 
scale factor such that the acceptance rate of the kernel lies in the range [0.2,0.6] while 
is a particle-based estimation of the variance-covariance matrix that corresponds to 
the distribution vr^. 

The initial distribution pi is another critical element for the speed of convergence 
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Algorithm 2 Find step length using Schafer and Chopin (2013) 


Input: e, p, 


l^0,u^ 1.05, 5 ^ 0.05. 


while \u — l\ > e and I < 1 — p do 


if fem=i + <5)^1 ^ < ^/2 then 

u <— ( i , 5 •(— ((^ + /)/2 


else 


/ •<— (5, <5 •(— ((5 + u)/2 

end if 
end while 

Return min(p + a, 1). 


of the algorithm and for the precision of the estimates. The first obvious option is to 
take the prior distribution, so that the SMC sampler moves simulations from the prior 
to simulations from the posterior distribution. Nevertheless, starting the SMC sampler 
with simulations from the prior can lead to a very low convergence rate of the algorithm 
and some large Monte-Carlo errors since there is no reason for the prior to be close to 
the posterior distribution. A better approach consists in initializing the sampler with an 
approximation of the target distribution from which we can easily sample. When one can 
maximize the posterior distribution, this is effectively done by a Laplace approximation. 
In this case, pi would be a normal distribution with mean mi and covariance matrix Si, 
where mi is set to the posterior mode and Si is equal to minus the inverse of the Hessian 
matrix evaluated at the posterior mode. In some settings (see Section |^, the numerical 
maximization of the posterior distribution might be particularly troublesome. In this 
case, we use a pilot run of a Gaussian random walk Metropolis-Hastings algorithm to 
get an estimate rh of the posterior mean and an estimate S of the posterior covariance 
matrix, and we define pi as the a normal distribution with mean rh and covariance 
matrix S. 

3.2.3. Discussion 

Using Algorithmic one can obtain estimates of the target distributions and the normal¬ 
izing constants directly from the variables generated by the sampler. Indeed, at the end 
of iteration T, an approximation of the target distribution 7r(0\zi:n) is given by: 



m=l 
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Moreover an estimate of the normalizing constant Zt of the posterior distribution 
can be obtained as follows. Let Zt be the normalizing constant of vr*. Then, we can 


estimate of Z^lZi by (Del Moral et ah, 2006): 


Zj' 


n 

t=2 


Z,- 


t-l 


Zt- 


N 


t-1 




m=l 


T^{0T-lW.r 

m{eT-i) 


pt-pt-i 


t > 2. 


A question of particular interest is whether the SN or the Gaussian distributions are 
more appropriate than the ESN distribution. In a Bayesian framework, the answer to 
this question is obtained by comparing the evidence, or marginal likelihood of the data, 
between the competing models. More specifically, consider the general test Hq : 6 = 9^ 
against Hi : 6 ^ 9^, where 9^ is the vector of parameters under the null hypothesis. In 
this respect, we make use of the Bayes factor defined by: 

_ mi(zi,„) 
mo{zi.,n) 


where zi:„ is the observations and where mi{zi-n) = / /*(zi:n|^)'?i'i(d0) is the evidence of 
model i £ {0,1}, with /i(zi:„|0) and TTi{d9) the corresponding likelihood and the prior 
distribution. 


It is well known (Morin et ah, 2013) that, if the competing models i £ {0,1} are 
regular, then the Bayes factor is a consistent criterion to discriminate between Hi and Hq. 


However, and as discussed in Section 2.2 if we wrongly assume that data are generated 
by some ESN distributions when the true underlying model is Gaussian, then the Fisher 
information matrix is singular and therefore there is no theoretical guarantee that the 
Bayes factor selects asymptotically the true model. We leave this issue for further 
research and rather assess the Bayes factor reliability through Monte Garlo simulations 
in Section m 

At this stage it is worth noting that testing the ESN distribution against the Gaus¬ 
sian model is straightforward. Indeed, the evidence under the ESN distribution can be 
directly obtained as a by-product of Algorithm [T| as explained above, while that under 
the Gaussian distribution can be computed explicitly thanks to the Gaussian conjugate 
prior @ for ^ and S (see e.g. [Gelman et al.[[^004[): 


mo{zi.,n) 


1 

7 rnd /2 Yd{l2/2) |I4|^"/2 \Kn) 
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where 


n 

Kn = K + n, Z/„ = l/ + n, Vn = V+ y^izi-Zn)izi-ZnY-\ -^-(z„ - 4°)(z„ - 

K + n 

1=1 

Finally, note that one key feature of SMC algorithms is their flexibility. Indeed, 
the implementation of Algorithm only requires to be able to evaluate the likelihood 
function. The Bayesian methodology developed in this section can therefore be easily 
modified to carry out parameter inference in (complicated) parametric models based 
on the ESN distribution. This point is illustrated in Section |4.2| where we apply the 
proposed methodology on an ESN sample selection model. 

4. Numerical study 

In this section we provide some Monte Carlo simulations in order to assess the per¬ 
formances of the proposed Bayesian approach and the behaviour of the posterior distri¬ 
bution. We consider two main data generating processes: (I) IID univariate extended 
skew-normal random variables, and (2) an extended skew-normal sample selection model 
(ESNSM). Eor the IID setting, the SMC sampler is initialized with a Laplace approx¬ 
imation of the posterior distribution while, for the ESNSM, the maximization of the 
posterior distribution turns out to be too sensitive to the choice of initial values in order 
to be useful in the construction of a good approximation of the posterior distribution. 
In that case, and as described above, we calibrate the initial distribution of the sampler 
using 10 000 iterations of a pilot Metropolis-Hastings algorithm. Finally, in all of the sim¬ 
ulations presented below, the propagation step of the tempered sequential Monte Carlo 
algorithm is based on r = 3 iterations of the Gaussian random walk Metropolis-Hastings 
kernel described in Section 13.21 

4.1. Example 1: IID univariate ESN random variables 

We first consider a sample of IID ESN random variables. To study the implications of 
the parametrization of ESN distributions, we use two data generating processes: 

Zi,--- ,Z„~T5WP(2,6,5,-2) (4) 

and 

Zi,-- - ,Z„~T5AAp^(2,l,5,-0.8) (5) 
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where the sample size n is successively 1000, 5 000, and 10 000. The variance, skewness 
and kurtosis of the first ESN distribution Q are respectively given by 2, 1, 4 whereas 
those of the second ESN distribution are respectively given by 6.60, 0.99, and 4.28. 
Eor all of the simulations, the parameters for the prior distributions, defined in Section 


V = 2Id, D = u, K = K and = 10. Einally, the tempered sequential Monte 

Carlo algorithm makes use of 10 000 particles. 


3.1, are set as follows: k = 0.1, = 0, i/ = max(6,d + A), V = 12/^, 


4.1.1. Parameters estimation 

Tables and [^report respectively the results for the two ESMi distributions Q and ([^ 
when the sample size is 1000 and 5 000. Several points are worth commenting. Eirst, as 
to be expected, the parametrization matters irrespective of the posterior statistics criteria 
used to compare the overall fitting (posterior mean, posterior median or posterior mode) 
and of the sample size. More specifically, when the true model is defined from the hidden 
truncation-based representation (PI), the posterior mean, median and mode using the 
second parametrization have a larger bias than in the case of the first parametrization. 
Unsurprisingly, turning to the Bayes factor, we do observe a clear evidence in favor 
of the results obtained under PI. In contrast, when the true distribution is defined 
from the convolution-based representation, results in Table and especially the Bayes 
factor, clearly provide support for the P2-based estimates. This parametric dependence 
is further illustrated in Eigure[^ which displays the marginal posterior distributions using 
PI and P2 when the sample size is 1000. 


[Tables [T]j^ and Eigure here] 


Second, comparing Tables and it is worth noting that the results obtained using 
PI are less sensitive to the parametrization of the underlying model than those obtained 
under P2. To understand this point, note that the parameter values of the ESN distri¬ 
bution Q are such that N is close to the boundary of the parameter space {uP ss 0.038) 
and, consequently, inference for this parameter is very sensitive to the choice of the prior 


distribution (see e.g. Newton and Raftery 1994 Gelman, 2006). In particular, the prior 
we choose for N puts a very small weight to values close to zero and therefore tends to 
overestimate N. This nearly boundary problem is critical in the sense that even ’’non 


informative” prior distributions can have a substantial effect on inference (see e.g. Gel 















man et al., 2008). For that reason, and contrary to the current practise (see e.g. Adock 


2004; Liseo and Parisi, 2013), we advocate for the use of the (^, S, ct, A)-parametrization 


to carry out parameter inference in the ESN (and in the SN) distribution. 

However, and this is our third observation, when the sample size gets larger and larger, 
posterior modes converge toward the true parameter values irrespective of the chosen 
parametrization. In particular, the middle panel of Figure [^provides strong support for 
the convergence of the marginal posterior modes when the sample size is 10 000. 

Finally, taking the low number of particles {N = 10 000), the Monte Carlo error is 
rather small in all cases and for all parameters, especially as the sample size increases. 
However, it is at the expense of a somehow large computing time which is, for both 
parametrization, around 90 seconds for re = 1 000 and around 460 seconds for re = 5 000. 

4.1.2. Model selection 


As explained in Section 3.2.3 it is critical to assess the robustness of ESN distributions 


with respect to Gaussian distributions. Therefore we conduct some simulation exper¬ 
iments regarding the Bayes factor to test the null hypothesis of normality against the 
alternative hypothesis of an extended skew-normal distribution, T5AA^'^^^(2, 6, a. A), for 
different {a, A) pairs. The results are reported in Table|^ which describes the percentage 
of samples where the evidence in favor of the ESN hypothesis is poor (log^g < 0.5), 
substantial (0.5 < logj^g-®io < 1); strong (1 < log^g Hio < 2) and decisive (log^gHio > 
2 ). 


[Table here] 

The results presented in the first three lines of Table are obtained for a sample size 
of re = 100. We observe that, despite the small number of observations, the Bayes factor 
yields very good results for (a. A) = (0,—) (i.e., Gaussian model) and (a. A) = (5,-2). 
Indeed, in both cases and in all samples, the Bayes factor selects the correct model with 
a strong confidence. For (a, A) = (0.5,0), estimations are in favour of the Gaussian 
distribution although the underlying model is ESN. This results is intuitive. Indeed, the 
Bayes factor penalizes for the number of parameters. Therefore, since A is useless when 
the underlying model is Gaussian, it is natural that the Bayes factor is biased toward 
the Gaussian distribution when a is close to zero. In contrast, when the sample size 
increases (from re = 100 to 5 000), the Bayes factor selects the correct model with a 
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strong confidence. These results suggest that the Bayes factor is convergent even if no 


formal proof for this specihc test is yet available in the literature (see Section 3.2.3). 


4.2. Example 2: Extended skew-normal sample selection model 

One key feature of the proposed methodology is its adaptability since SMC samplers can 
be used, at least from a theoretical point of view, as soon as one can evaluate efficiently 
the likelihood function. To illustrate this point in a more complicated set-up than in 
the previous subsection, we consider the estimation of a sample selection model based 
on ESN error terms. 


4.2.1. Model description 


Thanks to Dehnition the application of ESN distributions in sample selection models 
or Tobit-type models ( | Amemi^ 1986 Maddala and Lee, 1976) is a natural choice since 
any hidden truncation of normal component densities leads to such a distribution (see 


Arnold and Beaver, 2002). In this respect, starting from the Gaussian sample selection 


model (Heckman 1976), a (multivariate) extended skew-normal sample selection model 
(ESNSM) can be dehned by: 


( 6 ) 



fY? = 


\s: = , 

where B G /32 G 

, and 


^ = (^1 


Si 

S21 


S12 

1 


, a = («!, 02 ), A 


(7) 


with ^ ^ ^(A/co) IE[ej] = 0^. We assume that we observe Si = I]r_^(S*) and 

Yj = YJ'S'j, with I[yi(-) the indicator function of A C M. The likelihood function of the 
model, which is required to compute the importance weights of the SMC sampler (Al¬ 
gorithm!^, follows from a direct application of Proposition!^ (closure under conditioning 
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and marginalization of the extended skew-normal family of distributions): 


L„(0,/32,s)=n 


2=1 


^>2 (-/32Xj - ^2, 1, C2Q;2, C2A) 


$ 


(- 


C 2 A 




l-Si 


(l)d (yijSxj ^1,1:1) 


\\/l+^2 

^2{mi, (T22.2^ -“2, A -h a\{yi - Byn - ^ 1 )) 


d> 


Cl A 


l+a'i'Eiaicl 


where m* = + ( 3 ' 2 'x.i + S 12 S 7 I (y* — Bxi — $ 1 ) and with a\ and 0.2 defined as i 


m 


Proposition The prior distributions for the a. and A parameters are the same as the 
ones defined in Section 3.1, while those for S, B and P 2 discussed in Appendix [B| 


( 8 ) 


4.2.2. Simulation set-up 

The numerical study is conducted for the univariate extended skew-normal sample se¬ 
lection model: 

jy* = PiQ + PiiXii -|- eii 
\^i = /^20 + 1^22X21 + £ 22 , 

and 

{eu,e2^)^SS^fi^^^ ),(2,l),-2) (9) 

where p G {—0.9,0.3, —0.9}. The parameter value of p is a key issue in sample selection 
models. Notably when p = 0, there is no selection effect. On the other hand, it can be 
shown that the correlation between en and 622 increases with this parameter, as shown in 
FigureThe parameter values for the /3’s parameters are respectively given by /3io = 3, 
/3ii = — 2 , /I 20 = 1-5 and /322 = 2 while the covariates xu and X 2 i are assumed to be 
independent A/}(0,2) random variables (without loss of generality). This setup implies 
that Si = 0 for about 30% — 35% of the n = 1 000 observations. 

[Figure here] 

We discuss below some Monte Carlo simulations results for the extended skew-normal 
sample selection model (§-(§. The purpose of this numerical study is to compare it with 
the standard Tobit-type 2 model (i.e., the sample selection model with Gaussian errors, 
see 


Amemiya, 1986) regarding the estimation of the parameters and of the marginal 


effects. 
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4.2.3. Parameters estimation 


Table [^provides the posterior mean and the standard deviation of 50 independent estim¬ 
ates of the parameters of the model Q-([^ under the two parametric assumptions (i.e., 
the bivariate extended skew-normal and the Gaussian distribution of the error terms). 
Results are reported for the three different values of p. 


[Table here] 


Regarding the estimation of the constant and slope parameters of the regression equa¬ 
tion, /3io and /3ii, we observe that the distributional assumption has a limited effect 
on the estimated values in all scenarios. A similar result is obtained for the Student 


selection model in Marchenko and Genton (2012) and for the skew-normal model in 


Ogundimu and Hutton (2012). On the other hand, the estimation of the corresponding 


parameters in the selection equation, /32o and /322, are more sensitive to the choice of the 
error terms distribution. Indeed, if the Gaussian assumption leads to a small bias for 
these parameters when the correlation between the variable of interest and the selection 
variable is low (i.e. when p = 0.3, implying a correlation between -0.02 and -0.03), the 
results obtained with the Tobit 2 model for these parameters are significantly biased for 
larger values of |/?|. 


[Figures 3a 3b here] 


To illustrate the importance of the bias for /32o and (322, Figure [^reports, when p = 
0.9, the individual estimates of these parameters over 50 simulations in the presence of 
misspecified error terms—they are wrongly assumed to be normally distributed. Taking 
that the true parameter vector is given by (/320, /322) = (1-5, 2), we observe that all of the 
estimates of /32o and (^22 are much larger than the true underlying parameter values. To 
some extent, this result is consistent with standard results relative to the misspecification 


issues of the maximum likelihood estimator of Tobit-type models (Amemiya, 1986). 


In contrast, when the model is correctly specified, the constant and slope parameters 
of the selection process are well-estimated irrespective of the correlation parameter p. 
Notably, the posterior mean of each parameter is close to the true parameter value and 
the estimation error is small. Regarding other parameters, we obtain very good estima¬ 
tions of af and p for which we observe both a small bias and a small standard deviation. 
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The estimation of a 2 turns out to be more challenging due to the loss of information 
engendered by the censorship mechanism through S*. Moreover, the posterior mean 
of A is close to the true value at the expense of a relatively large standard deviation 
(precision), especially with respect to other parameters. 

Further evidence is provided by Figure ??, which displays the marginal posterior 
distributions of the parameters in the case of one realization of the model with 

p = 0.3. In addition to previous results, three points are worth commenting. First, the 
posterior modes are close to the true parameter values. Second, the marginal distribution 
for the /3’s parameters are very concentrated around the mode. Third, the sign of the a’s 
parameters, and hence of the skewness of the data, is well-identified since the posterior 
mass on {a* < 0,i = 1,2} is close to zero. In contrast, there is a small but significant 
posterior probability for the event {A > 0} suggesting that more observations are needed 
to identify more precisely this parameter. 


[Figures 1^ here] 

4.2.4. Marginal effects 

For ease of interpretation, it is arguably better to consider the (average) marginal effects 


(Cameron and Trivedi 2005) since only the sign (but not the magnitude) of the coeffi¬ 
cients can be readily interpreted in Tobit-type models. In this respect, we compute the 
marginal effects (see Proposition]^ in Appendix [B|) and Figure [^displays marginal effect 
estimates of (522 on E[yi*|S'j = l,Xj] for a realization of the above model with p = 0.3. 
The main result is that the Gaussian model is not able to account for substantial het¬ 
erogeneity in marginal effects. Indeed, a visual inspection shows that the distribution of 
Gaussian estimates is much more concentrated than the distribution of the true values. 
In addition, the marginal effects obtained from the Tobit type-2 models are in all cases 
larger than -10 although for a very significant proportion of individuals the marginal 
effect of (522 is indeed smaller than this threshold (with a minimum nearby -60). The av¬ 
erage marginal effect estimate under the Gaussian assumption is around -0.14 while the 
true value is about 15 times larger (around -2.08). In contrast, this estimate under the 
ESN assumption is -2.22. Some marginal effect estimates under the correct parametric 
assumption are also reported and are, as to be expected, very close to the true values. 
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5. Applications 


In this section, we proceed with two real applications. In both cases, estimations are 
performed using the Pl-parametrization (Definition [^, with the prior distribution as in 


Section 4.1 The absence of a constraint on A contrasts with most of the applications of 


skew-elliptical distributions in the literature that set the A parameter to zero (or consider 
some arbitrary value of c). Finally, the SMC sampler is initialized using a pilot run of a 
Gaussian random walk Metropolis-Hastings and the propagation step is based on r = 3 


iterations of a Gaussian Metropolis-Hastings kernel (see Section 3.2) 


5.1. Transfer fees of soccer players 

As an application of the univariate ESN, we consider a data set with 1 062 observations on 
(log-) transfer fees in major European soccer leagues. Data, which have been collected 
from various sources and are available upon request, cover the period 2008-2012 for 
the first league (England and France), Bundesliga (Germany), Calcio (Italy) and Liga 
(Spain). 

[Figures and here] 

Figure]^ (resp. Figure]^ presents the marginal posterior distributions when data are 
assumed to be randomly generated from an ESN distribution (resp. a SN distribution). 
Visual inspection of the marginal posterior distribution indicates that the ESN-based 
marginal posterior distribution of A has most of its mass on the interval (—oo, —2]. Paired 
with the fact that the posterior distribution for a has most of its mass on [1.2, oo), this 
suggests that the ESN-based specification fits better the overall distribution of data than 


the SN-based specification of Azzalini (1985). Further evidence can then be provided by 


comparing the marginal likelihood values of the two models. Notably, using the output 
of the SMC sampler, the (log-) evidence of the ESN-based model is -685.0374 whereas 
it is only -797.1437 in the case of the SN-based model. This means that the evidence in 
favor the ESN-based model can be considered as being ’’decisive” in the sense of Jeffreys] 
(1939). Einally, to assess the robustness of our results, Figure compares the ESN 
estimate of the density function of the data with a non-parametric estimate: one can 
observe that both provide very similar results. 


[Figure here] 
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5.2. Bivariate ESN: Financial Data 


As a final illustration of the proposed algorithm, we proceed with a real financial data 
Liseo and Paris! (|2013). There is an impressive literature in finance that 


set as in 


has witnessed the fact that (high-frequency) financial returns are skewed and display 
leptokurtic tails (e.g., see Jondeau et al.| 2006; [Genton 2004) and may have strong 
implications in portfolio selection, asset pricing models or risk measurement (among 
others). In this respect, we consider a simple i.i.d. bivariate sampling model. More 
specifically, we analyse the weekly returns (in percentage) of two US stocks, namely 
“ABM Industries Incorporated” (ABM) and “The Boeing Company” (BA). The sample 
size covers the period Jul 19, 1984 to Jul 28, 2014 (1566 observations) 0 


[Figures and here] 


Figures and depict, respectively, the marginal posterior distribution of each para¬ 
meter under the ESN and the SN assumption. Two points are worth commenting. First 
the contour plot of the density of the estimated SSAf 2 model suggests that raw data, 
which are skewed and fat-tailed, can be reasonably well-captured by this specification. 
Second, the marginal posterior modes of the shape (ai and 0 ( 2 ) and the shift parameter 
(A) are roughly given by 0.13, 0.20 and -3, respectively. Combined with the fact that 
the (marginal) posterior of each of these parameters has a negligible mass with posit¬ 
ive (for A) or negative (for ai, 02 ) values, the estimation provides strong support for 
the application of an extended skew-normal distribution in order to jointly model ABM 
and BA. Moreover, according to standard stylized financial facts of weekly returns, the 
location parameters, and ^ 2 ; are negative (large negative returns are more import¬ 
ant than large positive returns) and the marginal posterior modes of the unconditional 
variance-covariance parameters (af, cr^, 0 -^ 2 ) support large volatility and co-volatility. 

Finally we proceed with model selection. Using the SMC estimate of the Bayes factor, 
we find that the evidence in favor of the skew-normal bivariate distribution proposed by 


Liseo and Paris! 

(2013 

) is poor (in the sense of 

Jeffreys 

1939 


^We also perform estimation with daily and monthly returns. Our main results remain unchanged. 
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6. Conclusion 


In this paper, we propose a new Bayesian computational approach, which rests on a 
tempered sequential Monte Carlo sampler, to estimate (multivariate) extended skew- 
normal distributions. Among others, the proposed approach have several advantages. 
First, it overcomes some issues encountered in standard maximum likelihood estima¬ 
tion. Second, in contrast to MCMC methods, it is easy to build a SMC algorithm that 
is adaptive in the sense that it can adjust sequentially and automatically its sampling 
distribution to the problem at hand provided some well-defined prior distributions. Espe¬ 
cially, it can implemented for a large class of (multivariate) skew-elliptical distributions. 
Third, it allows to compute easily as a by-product the marginal posterior distributions, 
the normalizing constant and thus the Bayes factor. Fourth, it embeds as a special case 


the population algorithm provided by Liseo and Parisi (2013). 

Monte Carlo simulations provide evidence regarding the robustness of the proposed 
algorithm with different data generating processes. Irrespective of the model considered 
(sampling models, extended skew-normal sample selection models), posterior statist¬ 
ics are rather precise (with a low standard deviation) in a tractable computing time. 
Moreover, results suggest that the hidden truncation-based parametrization is more ro¬ 
bust for estimation than the convolution-based parametrization. Directions for future 


research include more comprehensive empirical applications (Gerber and Pelgrin, 2014) 
and the derivation of more general models with hidden truncation, censoring or select¬ 
ive report with the (multivariate) extended skew-normal family of distributions or some 
unified skew-elliptical distributions. 
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A. Proof of Proposition 

Let ln{0) = logLn(^) and 6 G 0* where Qe{l*) = {0 '■ Then, IniGnc) ~ 

ln{0) > 0 means that 

1 ” 

lni0G)-^ni0) - {I + a{zi - m)) + log ^ {I / Co) > 0 

i=l 

where is the log-likelihood corresponding to the Gaussian model. A sufficient condition 
for the above inequality to hold is 

log^^ I . 1 > log$(r +c(l + 2 „ + e)) 

V 1 + «g + ^)^7 

where Zn = max{| 2 ;j|}. This is equivalent to 

e + Jl* + {o'n,G + + Zn — ^n,G + c)) 

7* ^ 7* ._ _y_I_ 

^ — ^n,e • ! --- 

1 - + + 

Hence, for all e > 0, there exists a ^ such that 

- m - - log ck (/ + a{zi - m)) + log $(//co) > 0 V0 G 0,(/; J. 

2 = 1 

To prove part 2., let e and M > 1 be such that c„ := ||0„ — 0 g|| = jj where 0 = {0,1). 
Then, if 



r 

1/2 

r 

r - e 

^n,€ ^ 

e2 

<ln< f 

1 '-n 

e2 


we have \\ 0 n — 0 ^q^\\ < e so that ln{ 0 Q’‘) > ln{ 0 n)- 


B. Extended skew-normal sample selection models 

B.l. Prior distributions for S B and (32 


When available, the conjugate prior distribution is frequently used in bayesian analysis. 
Under Gaussian error terms and no selection effect, the conjugate prior distribution for 
/?! and S is the normal-inverse Wishart distribution: 


7r(/3i, K, V, V) oc exp ^-^tr(US - |(/3i - ^ 0 cn^X'X){(3^ - 
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_ ^+|/ 3 i 1+2 

X S 2 


where is a scale factor, V is a, d x d positive definite matrix, k and u are real such 
that u > \Pi\ + d Since the ESN distribution generalizes the Gaussian distribution, 
and because the presence of selection effect does not modify our prior knowledge, we 
choose this prior distribution for Pi and S. 

Using a similar argument, a possible choice of prior distribution for the parameters 
of the selection equation is P2 ~ where is a scale factor. 

This choice or prior distribution for (/?, E) is particularly convenient for model selection 
because under Gaussian error terms and no selection effect the posterior mean of Pi 
(respectively, S) has a closed form expression provided that Pi and P2 are a priori 
independent. In the numerical study (Section | 4 . 2 [ ), parameters of prior distributions are 
given by = 0 and = 5 n, with n the sample size. 


B.2. Determination of the marginal effects 


Proposition 3 . Consider the univariate extended skew-normal sample selection model 
defined by Q and Q. Let 


T{a, a, A) 


(/>(a)^(A + aa) 
$2(0, 1 , a, A) 


S(a, a. A) 


0(A/co)$ (^aco + 
^>2(0, l,a,A) 


Then, 


E [S*\Si = 1, xi] = /3'x, + T 2 i + —<52* 

C02 

E[yj*|S'i = l,Xj] = ^li +x'^i + <Ti2r2j -\-a1V2S2i 


where T2i = r (^2 + x'/ 32 , -020.2, C2A), < 52 * = <5 (^2 + x'/ 32 , -0202, C2A) and V2 = 


_ pc2a2+C2(l—p^)ai 
C02 


Proof: See Gerber and Pelgrin ( 2014 ). 


^This last condition is not necessary but ensures that all the components of E has a finite variance. 
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C. Tables 


Table 1 : Estimation of univariate £SMi distributions Q 

Estimation under PI Estimation under P2 




Mean 

Median 

Mode 

Mean 

Median 

Mode 



-15% 

-13% 

-11.5% 

-73.5% 

-73% 

-72.5% 

5 = 2 


(0.010) 

-8% 

(0.009) 

-7.5% 

(0.012) 

-7% 

(0.008) 

-31% 

(0.009) 

-30.5% 

(0.012) 

-30% 



(0.002) 

(0.003) 

(0.005) 

(0.003) 

(0.004) 

(0.006) 



-2.3% 

-3.5% 

-4.3% 

25.3% 

24.7% 

24.2% 

(t2 = 6 


(0.017) 

3.7% 

(0.016) 

3.4% 

(0.019) 

3.2% 

(0.011) 

15.5% 

(0.013) 

15.3% 

(0.016) 

15.2% 



(0.004) 

(0.005) 

(0.009) 

(0.005) 

(0.006) 

(0.010) 



-19.8% 

-20.2% 

-20.4% 

-39.2% 

-39.8% 

-40.0% 

0 = 5 


(0.006) 

-5.8% 

(0.006) 

-6.0% 

(0.010) 

-6.2% 

(0.006) 

-19.4% 

(0.005) 

-19.6% 

(0.007) 

-19.6% 



(0.003) 

(0.004) 

(0.006) 

(0.002) 

(0.003) 

(0.005) 



55% 

47% 

42% 

212% 

207.5% 

204% 

A = -2 


(0.040) 

38% 

(0.036) 

35.5% 

(0.041) 

34% 

(0.023) 

118% 

(0.024) 

116.5% 

(0.029) 

115% 



(0.011) 

(0.014) 

(0.021) 

(0.011) 

(0.013) 

(0.022) 




-1473.84 



-1532.98 


logm(2:i; 

d 


(38.37) 

-8065.57 



(34.00) 

-8074.10 





(13.01) 



(18.22) 


Time (in 

seconds) 


60.22 

120.44 


34.05 

124.26 


Notes: The results are obtained from 50 estimations of the model with Ai = 10 000 particles. Mean 
estimates are reported as percentage deviation of the true parameter value, and standard deviations are 
given in brackets. For each parameter, the first (respectively, last) two rows correspond to n =1 000 
(respectively, n =5 000). 
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Table 2: Estimation of univariate SSNi distributions 

Estimation under PI Estimation under P2 



Mean 

Median 

Mode 

Mean 

Median 

Mode 


76% 

83% 

88.5% 

-10.5% 

-9% 

-7% 

C = 2 

(0.031) 

(0.030) 

(0.035) 

(0.017) 

(0.019) 

(0.029) 

50.5% 

53.5% 

55.5% 

26% 

27% 

28% 


(0.010) 

(0.012) 

(0.022) 

(0.010) 

(0.010) 

(0.014) 


-12.7% 

-14.1% 

-15.2% 

6.8% 

6% 

5.4% 

II 

b 

(0.092) 

(0.082) 

(0.094) 

(0.045) 

(0.051) 

(0.071) 

-4.4% 

-5% 

-5.5% 

1.1% 

0.9% 

0.6% 


(0.029) 

(0.037) 

(0.055) 

(0.024) 

(0.030) 

(0.046) 


2% 

2% 

2% 

2% 

-2% 

0% 

a = 0.98 

(0.001) 

(0.001) 

(0.001) 

(0.001) 

(0.001) 

(0.002) 

3.1% 

4.1% 

4.1% 

4.1% 

4.1% 

4.1% 


(0.000) 

(0.000) 

(0.001) 

(0.001) 

(0.000) 

(0.001) 


-40.2% 

-43.1% 

45.8% 

6.1% 

4.4% 

3.2% 

A = -4.08 

(0.032) 

-25.2% 

(0.031) 

-26.5% 

(0.034) 

-27.5% 

(0.019) 

-12.5% 

(0.019) 

-13% 

(0.033) 

-13.2% 


(0.010) 

(0.013) 

(0.021) 

(0.009) 

(0.010) 

(0.018) 



-2 244.49 



-2 184.73 


logm( 2 i,„) 


(41.69) 

-11500.13 



(30.01) 

-11476.46 




(25.22) 



(41.65) 


Time (in seconds) 


38.52 



61.40 



125.98 



165.89 



Notes: The results are obtained from 50 estimations of the model with = 10 000 particles. Mean 
estimates are reported as percentage deviation of the true parameter value, and standard deviations are 
given in brackets. For each parameter, the first (respectively, last) two rows correspond to n =1 000 
(respectively, n =5 000). 


Table 3: Bayes factors 


(a. A) 

logio Bio < 0.5 

0.5 < logio -^10 < 1 

1 < logio -^10 < 2 

logio Bio > 2 

o 

o 

t-H 

II 

100% 

0% 

0% 

0% 

(5,-2) 

1% 

1% 

4% 

96% 

(0.5,1) 

100% 

0% 

0% 

0% 

11=5 000 (0.5,1) 

0% 

0% 

0% 

100% 


Notes: The results are obtained from 100 samples. The number of particles is 10 000 and Bio denotes 
the Bayes factor to test the normality hypothesis. 
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Table 4: Estimation of sample selection model (§-(§ 


Parameter 

P 


Tobit 2 


ESNM 

True value 

Mean 

Standard deviation 

Mean 

Standard deviation 


0.3 

2.92 

0.0008 

2.94 

0.0006 


/3io 

0.9 

2.98 

0.0006 

2.99 

0.0005 

3 


-0.9 

2.97 

0.0005 

2.98 

0.0006 



0.3 

-1.98 

0.0004 

-1.96 

0.0004 


Ai 

0.9 

-1.99 

0.0004 

-1.99 

0.0003 

-2 


-0.9 

-1.99 

0.0003 

-1.990 

0.0352 



0.3 

1.58 

0.0010 

1.37 

0.0015 


P20 

0.9 

2.57 

0.0020 

1.78 

0.0021 

1.5 


-0.9 

2.10 

0.0015 

1.43 

0.0020 



0.3 

2.04 

0.0013 

1.77 

0.0020 


^22 

0.9 

3.32 

0.0026 

2.30 

0.0026 

2 


-0.9 

2.77 

0.0020 

1.87 

0.0028 



0.3 

2.22 

0.0012 

6.04 

0.0101 



0.9 

2.10 

0.0011 

5.74 

0.0074 

(6) 


-0.9 

1.60 

0.0008 

6.08 

0.0113 



0.3 

0.06 

0.0010 

0.39 

0.0015 


P 

0.9 

0.63 

0.0010 

0.82 

0.0006 



-0.9 

-0.76 

0.0008 

-0.90 

0.0004 



0.3 

- 

- 

3.04 

0.0154 



0.9 

- 

- 

3.27 

0.0165 

(2) 


-0.9 

- 

- 

1.86 

0.0066 



0.3 

- 

- 

2.17 

0.0145 


02 

0.9 

- 

- 

2.15 

0.0251 

(1) 


-0.9 

- 

- 

0.55 

0.0134 



0.3 

- 

- 

-2.54 

0.0234 


A 

0.9 

- 

- 

-1.85 

0.0207 

(-2) 


-0.9 

- 

- 

-3.38 

0.0164 



0.3 

-1448.91 

0.00251 

-1313.91 

0.0269 


logm(2;i:„) 

0.9 

-1358.97 

0.00401 

-1206.92 

0.0504 



-0.9 

-1291.19 

0.0059 

-1168.48 

0.0262 



Note: Using = 10 000 particles, results are obtained from independent 50 independent 
estimations. 


29 


















30 



D. Figures 


£SJ\f I distribution 
n=1000 


n=10 000 


£SMi distribution Q 
n=l 000 















Figure 1: Marginal posterior distributions for the parameters of the £SNi distributions 
© and ([^. The results for PI (respectively, P2) are in dark (respectively, in 
grey) and are obtained with N = 10 000 particles. 
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Figure 2: Contour plots of the zero mean £SJ\f 2 distribution ([^. The left, middle and 
right panels represent the contour plots of the zero mean £SN 2 distribution 
when the correlation parameter rho is respectively given by -0.9, 0.3, and 0.9. 


3.315 


.| 3.310 


3.305- , * 

2.5600 2.5625 2.5650 2.5675 2.5700 

estimated values of Pgo 

(a) The results are obtained using N = 10 000 
particles, 50 independent estimates of the 
selection parameters (/920,/?22) are repor¬ 
ted when the true parameter vector is 
(d20,/322) = (1.5,2) and p = 0.9. 



£-40- dS’ 


-60- 

-60 -40 -20 6 

True marginal effects 

o ESNSM ^ Tobit 2 

(b) The results are obtained with N = 50 000 
particle and p = 0.3. For each individual, 
we report the marginal effect estimate us¬ 
ing either the Tobit type-2 model (triangu¬ 
lar markers) or the ESNSM model (circular 
markers) against the true marginal effect. 


Figure 3: Bias for selection coefficients of a Tobit type-2 model (Figure [3^ and marginal 
effects for the ESNSM (|^-([^ (Figure |^. 
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Figure 4: Marginal Posterior distribution of the ESNSM Q-Q, evaluated with 50 000 
particles when p = 0.3. 
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Figure 5: Transfer fees of soccer players and marginal posterior distributions of the SN 
parameters. The results are obtained with N = 50 000 particles. 



Figure 6: Transfer fees of soccer players and marginal posterior distributions of the SN 
parameters. The results are obtained with N = 50 000 particles. 


0.8 



Figure 7: Estimates of the Non-parametric and ESN transfer fees distribution. The 
dashed line (respectively, solid line) represents the estimate of the ESN (re¬ 
spectively, non-parametric) transfer fees distribution. The ESN estimate is 
obtained with N = 50 000 particles. 
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Figure 8: Marginal Posterior distributions for the financial data under the ESN assump¬ 
tion. The results are obtained with N = 10 000 particles. Evidence (in log) is 
-8 631.379. 

(! cf 









Eigure 9: Marginal Posterior distributions for the financial data under the SN assump¬ 
tion. The results are obtained for N = 50 000. Evidence (in log) is -8 647.239. 
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